{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"align-center\">\n",
    "<a href=\"https://oumi.ai/\"><img src=\"https://oumi.ai/docs/en/latest/_static/logo/header_logo.png\" height=\"200\"></a>\n",
    "\n",
    "[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)\n",
    "[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)\n",
    "[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Quantization Tutorial.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "</div>\n",
    "\n",
    "👋 Welcome to Open Universal Machine Intelligence (Oumi)!\n",
    "\n",
    "🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.\n",
    "\n",
    "🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.\n",
    "\n",
    "⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).\n",
    "\n",
    "# Model Quantization Tutorial\n",
    "\n",
    "This tutorial demonstrates how to use AWQ (Activation-aware Weight Quantization) to compress large language models while maintaining performance.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "❗**NOTICE:** Model quantization requires a GPU. If running on Google Colab, you must use a GPU runtime (Colab Menu: `Runtime` -> `Change runtime type` -> Select `T4 GPU` or better).\n",
    "\n",
    "⚠️ **DEVELOPMENT STATUS**: The quantization feature is currently under active development. Some features may change in future releases.\n",
    "\n",
    "First, let's install Oumi with GPU support and the required quantization libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install oumi[gpu,quantization]\n",
    "%pip install triton==3.0.0  # Required for AWQ inference compatibility"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Basic AWQ Quantization\n",
    "\n",
    "Let's start by quantizing TinyLlama to 4-bit using AWQ:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting AWQ quantization...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2025-08-04 18:02:59,116][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:83] Starting AWQ quantization pipeline...\n",
      "[2025-08-04 18:02:59,117][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:115] Loading model for AWQ quantization: TinyLlama/TinyLlama-1.1B-Chat-v1.0\n",
      "[2025-08-04 18:02:59,118][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:118] 📥 Loading base model...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "33ed6979383d49fdb1c6a19ebcaaaaef",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2025-08-04 18:03:00,147][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:135] 🔧 Configuring AWQ quantization parameters...\n",
      "[2025-08-04 18:03:00,148][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:152] ⚙️  AWQ config: {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}\n",
      "[2025-08-04 18:03:00,148][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:153] 📊 Using 32 calibration samples\n",
      "[2025-08-04 18:03:00,149][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:154] 🧮 Starting AWQ calibration and quantization...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Repo card metadata block was not found. Setting CardData to empty.\n",
      "WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.\n",
      "Token indices sequence length is longer than the specified maximum sequence length for this model (8322 > 2048). Running this sequence through the model will result in indexing errors\n",
      "AWQ: 100%|██████████| 22/22 [01:13<00:00,  3.32s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2025-08-04 18:04:16,220][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:89] PyTorch format requested. Saving AWQ model...\n",
      "[2025-08-04 18:04:16,695][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:96] ✅ AWQ quantization successful! Saved as PyTorch format.\n",
      "[2025-08-04 18:04:16,697][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:97] 📊 Quantized size: 734.0 MB\n",
      "[2025-08-04 18:04:16,697][oumi][rank0][pid:195158][MainThread][INFO]][awq_quantizer.py:98] 💡 Use this model with: AutoAWQForCausalLM.from_quantized('tinyllama_awq_4bit_tutorial')\n",
      "\n",
      "✅ Quantization complete!\n",
      "Original size (fp16): 2.20GB\n",
      "Quantized size (4-bit): 0.72GB\n",
      "Compression ratio: 3.1x\n",
      "Size reduction: 67.4%\n"
     ]
    }
   ],
   "source": [
    "from oumi.core.configs import ModelParams, QuantizationConfig  # type: ignore\n",
    "from oumi.quantize import quantize  # type: ignore\n",
    "\n",
    "# Configure quantization\n",
    "config = QuantizationConfig(\n",
    "    model=ModelParams(model_name=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"),\n",
    "    method=\"awq_q4_0\",  # 4-bit AWQ quantization\n",
    "    output_path=\"tinyllama_awq_4bit_tutorial\",\n",
    "    calibration_samples=32,  # Number of calibration samples\n",
    "    # 32 for fast testing, 1024 for better accuracy\n",
    ")\n",
    "\n",
    "# Run quantization\n",
    "print(\"Starting AWQ quantization...\")\n",
    "result = quantize(config)\n",
    "\n",
    "# Calculate sizes and compression\n",
    "original_size_gb = 2.2  # TinyLlama 1.1B in fp16\n",
    "quantized_size_gb = result.quantized_size_bytes / (1024**3)  # type: ignore\n",
    "compression_ratio = original_size_gb / quantized_size_gb\n",
    "\n",
    "print(\"\\n✅ Quantization complete!\")\n",
    "print(f\"Original size (fp16): {original_size_gb:.2f}GB\")\n",
    "print(f\"Quantized size (4-bit): {quantized_size_gb:.2f}GB\")\n",
    "print(f\"Compression ratio: {compression_ratio:.1f}x\")\n",
    "size_reduction_pct = (original_size_gb - quantized_size_gb) / original_size_gb * 100\n",
    "print(f\"Size reduction: {size_reduction_pct:.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Using the Quantized Model\n",
    "\n",
    "Now let's load and use the quantized model for inference:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading AWQ model from: tinyllama_awq_4bit_tutorial\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Replacing layers...: 100%|██████████| 22/22 [00:03<00:00,  5.76it/s]\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4bb6be8ad12d41b1b2b2fbb56ebccd85",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/509 [00:00<?, ?w/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Model loaded! GPU memory: 0.03GB\n"
     ]
    }
   ],
   "source": [
    "import torch  # type: ignore\n",
    "from awq import AutoAWQForCausalLM  # type: ignore\n",
    "from transformers import AutoTokenizer  # type: ignore\n",
    "\n",
    "# Load the quantized model\n",
    "model_path = \"tinyllama_awq_4bit_tutorial\"\n",
    "\n",
    "print(f\"Loading AWQ model from: {model_path}\")\n",
    "model = AutoAWQForCausalLM.from_quantized(\n",
    "    model_path,\n",
    "    fuse_layers=False,  # Disable layer fusion to avoid compatibility issues\n",
    "    device_map=\"auto\",\n",
    ")\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\")\n",
    "\n",
    "if tokenizer.pad_token is None:\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "\n",
    "print(f\"✅ Model loaded! GPU memory: {torch.cuda.memory_allocated() / 1024**3:.2f}GB\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prompt: Explain the benefits of model quantization in simple terms:\n",
      "\n",
      "Generating response...\n",
      "Response:\n",
      "Explain the benefits of model quantization in simple terms:\n",
      "\n",
      "Model quantization is the process of compressing a neural network model into a smaller number of parameters without sacrificing performance. Here are some benefits of model quantization:\n",
      "\n",
      "1. Improved Model Size: Model quantization reduces the model's size, which can be beneficial for storage and transmission.\n",
      "\n",
      "2. Faster Training: Model quantization can lead to faster training, especially for smaller models.\n",
      "\n",
      "3. Reduced Inference Time: With less parameters, inference time can be reduced, leading to faster and more accurate inference.\n",
      "\n",
      "4. Improved Convergence: Model quantization can improve convergence, as the model's parameters are more accurately represented and optimized.\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Test inference\n",
    "prompt = \"Explain the benefits of model quantization in simple terms:\"\n",
    "\n",
    "# Tokenize\n",
    "inputs = tokenizer(prompt, return_tensors=\"pt\").to(\"cuda\")\n",
    "\n",
    "# Generate\n",
    "print(f\"Prompt: {prompt}\\n\")\n",
    "print(\"Generating response...\")\n",
    "\n",
    "with torch.no_grad():\n",
    "    outputs = model.generate(\n",
    "        **inputs,\n",
    "        max_new_tokens=150,\n",
    "        temperature=0.7,\n",
    "        do_sample=True,\n",
    "        pad_token_id=tokenizer.pad_token_id,\n",
    "    )\n",
    "\n",
    "# Decode and print response\n",
    "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
    "print(f\"Response:\\n{response}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Advanced Configuration\n",
    "\n",
    "AWQ offers several configuration options for fine-tuning the quantization process:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Configuration:\n",
      "- Output format: safetensors\n",
      "- Calibration samples: 1024\n",
      "- Group size: 128\n",
      "- AWQ version: GEMM\n",
      "- Zero point: True\n"
     ]
    }
   ],
   "source": [
    "# Advanced AWQ configuration with more calibration samples\n",
    "advanced_config = QuantizationConfig(\n",
    "    model=ModelParams(model_name=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"),\n",
    "    method=\"awq_q4_0\",\n",
    "    output_path=\"tinyllama_awq_advanced.safetensors\",\n",
    "    output_format=\"safetensors\",  # Use SafeTensors format\n",
    "    # AWQ-specific parameters\n",
    "    calibration_samples=1024,  # More samples for better calibration\n",
    "    awq_group_size=128,  # Weight grouping size\n",
    "    awq_version=\"GEMM\",  # AWQ kernel version (GEMM is faster)\n",
    "    awq_zero_point=True,  # Use zero-point quantization\n",
    ")\n",
    "\n",
    "print(\"Configuration:\")\n",
    "print(f\"- Output format: {advanced_config.output_format}\")\n",
    "print(f\"- Calibration samples: {advanced_config.calibration_samples}\")\n",
    "print(f\"- Group size: {advanced_config.awq_group_size}\")\n",
    "print(f\"- AWQ version: {advanced_config.awq_version}\")\n",
    "print(f\"- Zero point: {advanced_config.awq_zero_point}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, you learned how to:\n",
    "\n",
    "1. ✅ Quantize models using AWQ to 4-bit precision\n",
    "2. ✅ Load and use AWQ quantized models for inference\n",
    "3. ✅ Configure AWQ parameters for better quality\n",
    "\n",
    "\n",
    "### Key Benefits of AWQ:\n",
    "- **Memory Efficiency**: ~75% reduction in model size\n",
    "- **Speed**: Faster inference due to reduced memory bandwidth\n",
    "- **Quality**: Minimal performance degradation\n",
    "- **Compatibility**: Works with most transformer models\n",
    "\n",
    "Happy quantizing! 🚀"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "oumi-latest",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
